In [1]:
execution_mode = 'manual'

Decision Tree Model

The first model to be trained with supervised learning is to be a Decision Tree Classifier, compare [JudACaps]. This chapter shows the training and the performance measurement of two Ensemble models. First, a simple Decision Tree Classifier is trained without cross-validation, then the classifier will be statistically hardened with the help of cross-validation. As a second Ensemble classifier, a Random Forests will be trained and its performance tested.

Data Takeover

As the first step, the data from previous chapters have to be read in as input for processing in this chapter.

In [2]:
import os
import pandas as pd
import bz2
import _pickle as cPickle

path_goldstandard = './daten_goldstandard'

# Restore results so far
df_labelled_feature_matrix = pd.read_pickle(os.path.join(path_goldstandard,
                                                         'labelled_feature_matrix.pkl'),
                                 compression=None)

# Restore DataFrame with features from compressed pickle file
with bz2.BZ2File((os.path.join(
    path_goldstandard, 'labelled_feature_matrix_full.pkl')), 'rb') as file:
    df_attribute_with_sim_feature = cPickle.load(file)

df_labelled_feature_matrix.head()
Out[2]:
coordinate_E_delta coordinate_N_delta corporate_full_delta doi_delta edition_delta exactDate_delta format_prefix_delta format_postfix_delta isbn_delta ismn_delta ... part_delta person_100_delta person_700_delta person_245c_delta pubinit_delta scale_delta ttlfull_245_delta ttlfull_246_delta volumes_delta duplicates
0 -1.0 -1.0 -1.0 -1.0 -1.0 0.75 1.0 1.0 1.0 -1.0 ... 1.0 1.0 1.0 1.000000 1.000000 -1.0 1.000000 -1.0 1.0 1
1 -1.0 -1.0 -1.0 -1.0 -1.0 0.75 1.0 1.0 1.0 -1.0 ... 1.0 1.0 -0.5 0.818905 0.848485 -1.0 0.787879 -1.0 1.0 1
2 -1.0 -1.0 -1.0 -1.0 -1.0 0.75 1.0 1.0 1.0 -1.0 ... 1.0 1.0 -0.5 0.697740 0.848485 -1.0 1.000000 -1.0 1.0 1
3 -1.0 -1.0 -1.0 -1.0 -1.0 0.75 1.0 1.0 1.0 -1.0 ... 1.0 1.0 -0.5 0.818905 0.848485 -1.0 0.787879 -1.0 1.0 1
4 -1.0 -1.0 -1.0 -1.0 -1.0 0.75 1.0 1.0 1.0 -1.0 ... 1.0 1.0 -1.0 1.000000 1.000000 -1.0 1.000000 -1.0 1.0 1

5 rows × 21 columns

In [3]:
print('Part of duplicates (1) and uniques (0) in units of [%]')
print(round(df_labelled_feature_matrix.duplicates.value_counts(normalize=True)*100, 2))
Part of duplicates (1) and uniques (0) in units of [%]
0    99.43
1     0.57
Name: duplicates, dtype: float64

Decision Tree Classifier

Decision Tree is the most basic algorithm in the family of Ensemble methods. Its advantage is its clarity. It can be easily interpreted when looking at the trained model tree.

Train/Test Split for Decision Tree

The train/test split has been implemented as a general function $\texttt{.split}\_\texttt{feature}\_\texttt{target}()$ in a separate library called classifier_fitting_funcs.py. The function uses the library function $\texttt{sklearn.model}\_\texttt{selection.train}\_\texttt{test}\_\texttt{split}()$ from scikit-learn with a parameter $\texttt{stratify}$ in order to generate a balanced distribution of the two classes in the split data, the same as in the original distribution.

In [4]:
import classifier_fitting_funcs as cff

X_tr, X_val, X_te, y_tr, y_val, y_te, idx_tr, idx_val, idx_te = cff.split_feature_target(
    df_labelled_feature_matrix, 'train_validation_test')

X_tr[:5], y_tr[:5], idx_tr[:5]
Out[4]:
(array([[-1.        , -1.        , -1.        , -0.5       , -1.        ,
          0.625     ,  0.        ,  0.42857143,  1.        , -1.        ,
         -0.5       , -0.5       ,  0.49267677, -0.5       ,  0.54033531,
         -0.5       , -1.        ,  0.57608486, -1.        , -0.5       ],
        [-1.        , -1.        , -1.        , -1.        , -0.5       ,
          0.5       ,  0.        ,  0.42857143,  0.        , -1.        ,
         -1.        ,  0.        , -0.5       , -0.5       ,  0.50978836,
         -0.5       , -1.        ,  0.56688312, -1.        ,  0.51111111],
        [-0.5       , -0.5       ,  0.06      , -1.        , -1.        ,
          0.5       ,  0.        ,  0.42857143,  0.        , -1.        ,
         -1.        , -0.5       , -1.        , -1.        , -0.5       ,
         -0.5       , -0.5       ,  0.46245348, -0.5       , -0.5       ],
        [-1.        , -1.        , -0.5       , -1.        , -1.        ,
          0.625     ,  0.        ,  0.11111111,  1.        , -1.        ,
         -0.5       , -1.        , -0.5       , -0.5       ,  0.4047619 ,
          0.48095238, -1.        ,  0.4667789 , -1.        ,  0.55555556],
        [-1.        , -1.        ,  0.05      , -1.        , -0.5       ,
          0.25      ,  0.        ,  0.11111111,  0.        , -1.        ,
         -0.5       , -1.        , -0.5       ,  0.47484737,  0.4973368 ,
          0.56635908, -1.        ,  0.55436185, -1.        ,  0.77777778]]),
 array([0, 0, 0, 0, 0]),
 array([153040,  72045, 177429, 149431, 170753]))

The train/test split is done twice. The first split generates an intermediate set of data for training which consists of 80% and a set of data for testing which consists of 20% of the full data. The second split takes the intermediate training data as its basis and extracts an 80% set out of there which will be used for training the model. The remaining 20% out of the intermediate training data will be used for validating the model during training. This strict separation of the data used for training and used for validating the model conforms to the basic principle of machine learning that any model is to be tested with unseen data. If this principle is hurt and the test data is polluted with data, the model has been in contact with during the training phase, the model runs the risk of bias on validation.

In [5]:
print(X_tr.shape, y_tr.shape, X_val.shape, y_val.shape, X_te.shape, y_te.shape)
print('The test data set holds {:d} records of uniques and {:d} records of duplicates.'.format(
    len(y_te[y_te==0]), len(y_te[y_te==1])))
(166033, 20) (166033,) (41509, 20) (41509,) (51886, 20) (51886,)
The test data set holds 51591 records of uniques and 295 records of duplicates.

 Model Training for Decision Tree

Grid search is to be done with the Decision Tree Classifier. Goal is to find the best parameter set for the classifier. First, the parameter ranges, the grid points in the grid space are defined. In the following code cell, a global parameter $\texttt{execution}\_\texttt{mode}$ for controlling the size of the grid is used. Several run-modes of this notebook are foreseen. The global parameter is set in the very first code cell of this notebook and can be overwritten from outside when the notebook is called by Overview and Summary. When called from outside, a larger range of the grid space shall be executed with the goal to get a systematic result of the calculations. The execution of the notebook in its local mode has the goal to be done quickly, just to get a basic idea on how the models behave.

In [6]:
if execution_mode == 'manual' :
    depths = list(range(2, 30, 2)) # The number of features is 20.
    depths.extend([35, 40, 50, None])
    parameter_dictionary = {
        'max_depth' : depths,
        'criterion' : ['gini'],
        'class_weight' : ['balanced']
    }
elif execution_mode == 'full' :
    # Find best parameters of Decision Tree
    depths = list(range(4, 32, 2))
    depths.extend([35, 40, 45, 50, None])
    parameter_dictionary = {
        'max_depth' : depths,
        'criterion' : ['gini', 'entropy'],
        'class_weight' : [None, 'balanced']
    }
elif execution_mode == 'restricted' :
    depths = list(range(16, 26, 2)) # The number of features is 20.
    depths.extend([None])
    parameter_dictionary = {
        'max_depth' : depths,
        'criterion' : ['gini', 'entropy'],
        'class_weight' : [None, 'balanced']
    }
elif execution_mode == 'tune' :
    # Tune parameters of Decision Tree
    depths = list(range(1, 31))
    depths.extend([35, 40, 45, 50, None])
    parameter_dictionary = {
        'max_depth' : depths,
        'criterion' : ['gini', 'entropy'],
        'class_weight' : ['balanced']
    }

# Grid of values
grid = cff.generate_parameter_grid(parameter_dictionary)
The grid parameters are ...
max_depth [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 35, 40, 50, None]
criterion ['gini']
class_weight ['balanced']
 => Number of combinations : 18

The Decision Tree Classifier is fitted with grid search with the help of a function $\texttt{.fit}\_\texttt{model}\_\texttt{measure}\_\texttt{scores}()$ implemented in library classifier_fitting_funcs.py. This function takes the model instance as parameter and returns the scores of the fitted model on the validation data.

In [7]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=0)

# Save accuracy on test set
test_scores = []
for params_dict in grid :
    test_scores.append(cff.fit_model_measure_scores(dt, params_dict, X_tr, y_tr, X_val, y_val))

# Save measured accuracies
df_test_scores_dt = pd.DataFrame(test_scores).sort_values('accuracy_val', ascending=False)
Fitting with parameters {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 2}
 => validation score 96.198%
Fitting with parameters {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 4}
 => validation score 98.538%
Fitting with parameters {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 6}
 => validation score 99.248%
Fitting with parameters {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 8}
 => validation score 99.549%
Fitting with parameters {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 10}
 => validation score 99.807%
Fitting with parameters {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 12}
 => validation score 99.843%
Fitting with parameters {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 14}
 => validation score 99.899%
Fitting with parameters {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 16}
 => validation score 99.908%
Fitting with parameters {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 18}
 => validation score 99.913%
Fitting with parameters {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 20}
 => validation score 99.925%
Fitting with parameters {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 22}
 => validation score 99.925%
Fitting with parameters {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 24}
 => validation score 99.925%
Fitting with parameters {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 26}
 => validation score 99.925%
Fitting with parameters {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 28}
 => validation score 99.925%
Fitting with parameters {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 35}
 => validation score 99.925%
Fitting with parameters {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 40}
 => validation score 99.925%
Fitting with parameters {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 50}
 => validation score 99.925%
Fitting with parameters {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': None}
 => validation score 99.925%

Naming the missing ($\texttt{None}$) entry of $\texttt{class}\_\texttt{weight}$ 'unbalanced', will make the test score data more speaking.

In [8]:
ts_dict = {}

# kow = kind of weight
for kow in parameter_dictionary['class_weight']:
    ts_dict['unbalanced' if kow is None else kow] = [
        ts for ts in test_scores if ts['class_weight'] == kow]

Plotting the accuracy scores as a function of tree depth, is a way of determining the best tree depth for a Decision Tree Classifier. Very often, the accuracy for the training data increases monotonically with increasing tree depth towards its maximum value. This monotonical increase of the accuracy score is a sign of overfitting with the training data. The accuracy scores calculated with the validation data is expected to show a different behaviour, though. Validating the trained model with the validation data very often shows a distinct maximum value and a decrease of the accuracy score for higher values of tree depth after this maximum. The maximum accuracy score value for the validation data is interpreted as the best value for the tree depth of the best Decision Tree Classifier.

In [9]:
%matplotlib inline
import matplotlib.pyplot as plt
import results_analysis_funcs as raf

for kow in parameter_dictionary['class_weight'] :
    kind_of_weight = 'unbalanced' if kow is None else kow
    # Train data plot
    plt = raf.plot_accuracy(parameter_dictionary, ts_dict[kind_of_weight], 'accuracy_tr')
    # Validation data plot
    plt = raf.plot_accuracy(parameter_dictionary, ts_dict[kind_of_weight], 'accuracy_val')
    plt.ylabel('accuracy')
    plt.title(f'Measured accuracy on {kind_of_weight} train and validation data')
    plt.legend()
    plt.show()
    
    # Validation data plot
    plt = raf.plot_accuracy(parameter_dictionary, ts_dict[kind_of_weight], 'log_accuracy_val')
    plt.ylabel('log(1-accuracy)')
    plt.title(f'Measured accuracy on {kind_of_weight} validation data')
    plt.legend()
    plt.show()

The observation above does not show the expected effect of the accuracy score for the validation data. Even with a logarithmic scale, only a constant maximum with arbitrarily increasing tree depth can be observed. In this situation, the first tree depth which hits this maximum accuracy score value, will be taken for the best Decision Tree Classifier model.

In [10]:
best_params = cff.get_best_parameters(test_scores, parameter_dictionary)

dt_best = DecisionTreeClassifier(criterion=best_params['criterion'],
                                 max_depth=best_params['max_depth'],
                                 class_weight=best_params['class_weight'], random_state=0)

dt_best.fit(X_tr, y_tr)
The parameters for the best model are ...
max_depth = 20
criterion = gini
class_weight = balanced
Out[10]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight='balanced', criterion='gini',
                       max_depth=20, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=0, splitter='best')

Having 20 features in the feature matrix, a maximum tree depth of 20 is a fair result. Let's have a look at the graph of the Decision Tree.

In [11]:
! pip install graphviz
Requirement already satisfied: graphviz in /Users/andreas/anaconda3/lib/python3.7/site-packages (0.13.2)
In [12]:
path_tree_graphics = './documentation'

# Path for Decision Tree
decision_tree_dot = os.path.join(path_tree_graphics, 'decision_tree.dot')
decision_tree_png = os.path.join(path_tree_graphics, 'decision_tree.png')

from sklearn.tree import export_graphviz

# Export decision tree
dot_data = export_graphviz(
    dt_best, out_file=decision_tree_dot,
    feature_names=df_labelled_feature_matrix.drop(columns=['duplicates']).columns,
    class_names=['unique', 'duplicate'],
    filled=True, rounded=True, proportion=True
)

# Generate image in .png format
! dot -Tpng $decision_tree_dot -o $decision_tree_png
In [13]:
from IPython.display import Image
Image(decision_tree_png)
Out[13]:

Counting the layers of the tree confirms its depth of 20.

Performance Measurement of Decision Tree

The confusion matrix is used for testing the performance of the classifier [ConfMatr], see figure 1. In the confusion matrix, the records of class duplicate are the positive case, while the records of class unique are the negative case. The true negatives (uniques) and the true positives (duplicates) are the correctly classified predictions, where the notion "correct" refers to correct according to the classification of the provided test data set. The false negatives are the records that the model predicts as uniques but the reality of the test data classifies as duplicates. The false positives are the records that the model predicts as duplicates but the reality of the test data classifies as uniques.

In [14]:
from sklearn.metrics import confusion_matrix

y_pred_dt = dt_best.predict(X_te)

confusion_matrix(y_te, y_pred_dt)
Out[14]:
array([[51573,    18],
       [   11,   284]])

Looking at the numbers, above, the specific numbers depend on the parameters used for the model calculation and on the number of records used for training and for testing.

  • The left number in the first row is the true negative ($tn$) of the confusion matrix while ...
  • the right number in the first row is the false positive ($fp$).
  • The left number in the second row is the false negative ($fn$) and ...
  • the right number in the second row is the true positive ($tp$).
Figure 1 Confusion matrix based on [ConfMatr].

The explicit assessment of the specific figures will be done in Overview and Summary depending on the specific parameters used for a run.

In [15]:
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score

print('Score {:.3f}%'.format(100*dt_best.score(X_te, y_te)))
print('Area under the curve {:.3f}% - accuracy {:.3f}% - precision {:.3f}% - recall {:.3f}%'.format(
    100*roc_auc_score(y_te, y_pred_dt),
                100*accuracy_score(y_te, y_pred_dt),
                100*precision_score(y_te, y_pred_dt),
                100*recall_score(y_te, y_pred_dt)
               ))
Score 99.944%
Area under the curve 98.118% - accuracy 99.944% - precision 94.040% - recall 96.271%

The confusion matrix allows for calculating some characteristic numbers [ConfMatr].

  • The accuracy is the ratio of the correctly predicted cases and the total number of records in the data $$acc = \frac{tp+tn}{p+n}.$$ The accuracy is equal to the score value in the output above.
  • The precision is the ratio of true positives and the number of records of class duplicate in the data $$ppv = \frac{tp}{tp+fp}.$$
  • The recall (or sensitivity) is the ratio of true positives and the number of real positive records in the data $$tpr = \frac{tp}{tp+fn}.$$

The prediction $y_{pred}$ of a classifier for a record of the training or test data set is based on the prediction probability $y_{pred}^{probability}$, a tuple of two numbers in the closed interval from 0 to 1 where the sum of the two tuple elements is equal to 1 $$y_{pred}^{probability} = (a, b) \texttt{ with } a, b \in [0, 1] \texttt{ and } a+b = 1.$$ Function $\texttt{.predict}()$ of the classifer uses a value of 0.5 to assign a record uniquely to either class. This value of 0.5 is the default threshold of the classifier. To get $y_{pred}^{probability}$, the model's function $\texttt{.predict}\_\texttt{proba}()$ can be called. With the resulting raw probability tuple, the threshold can be adjusted. The effect of varying the threshold is a shift in the allocations of records in the quadrant of the confusion matrix. This is equivalent to a change of the characteristic numbers. Modifying the threshold value allows for tuning a model with the goal of maximizing a desired characteristic number. As an example, the precision may be maximized. The increase of one characteristic number will decrease the other characteristic numbers like the accuracy, though.

In [16]:
y_proba = pd.Series(dt_best.predict_proba(X_te)[:,1])
y_proba[(y_proba>0) & (y_proba<1)] # Empty Series means no result
Out[16]:
Series([], dtype: float64)

Unfortunately, the Decision Tree Classifier exclusively predicts probability tuples of kind $(0,1)$ for duplicates and $(1,0)$ for uniques. Therefore, the effect of changing threshold cannot be illustrated with this classifier. This will be made up below with the Random Forests classifier.

With the notion of the threshold, one more characteristic number can be explained. The roc auc (area under the receiver operating characteristic curve) is derived from a graphical plot of the fraction of the true positive rate $tpr$ versus the false positive rate $$fpr = \frac{fp}{fp+tn}$$ at various settings of the threshold [rocauc]. The value of the roc auc may vary between 0 and 1. If a classifier does not generate any relevant information, its value is 0.5. The closer the roc auc value is to 1, the better is the prediction quality of the classifier.

With these characterstic numbers derived from the confusion matrix, the prediction performance of a classifier is quantified. The comparison of the characteristic numbers of different classifiers will produce a ranking in chapter Overview and Summary. The ranking metric for assessing the overall best model of all calculated models is to remain the accuracy. If the accuracy happens to be equal for two different models, then the roc auc will be considered for a second metric.

This kind of fine ranking with a another metric number has to be pointed out. Within a model, the best classifier is ranked first using the accuracy score. When comparing and ranking the models among each other, the metric for assessing the rank remains the accuracy score. Adding the roc auc value for fine ranking moved metric numbers like precision and recall into additional consideration. This augmented information on the model's performance is the motivation for this kind of fine ranking. Unfortunately, there is no garantee that the roc auc value as a balanced mixture of several scoring values holds the best possible value for the model with the best accuracy. This weakness will be accepted for this capstone project, though.

In [17]:
# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df_attribute_with_sim_feature.columns)

df_attribute_with_sim_feature.iloc[idx_te].sort_index().sample(n=5)
Out[17]:
duplicates coordinate_E_delta coordinate_E_x coordinate_E_y coordinate_N_delta coordinate_N_x coordinate_N_y corporate_full_delta corporate_full_x corporate_full_y doi_delta doi_x doi_y edition_delta edition_x edition_y exactDate_delta exactDate_x exactDate_y format_postfix_delta format_postfix_x format_postfix_y format_prefix_delta format_prefix_x format_prefix_y isbn_delta isbn_x isbn_y ismn_delta ismn_x ismn_y musicid_delta musicid_x musicid_y part_delta part_x part_y person_100_delta person_100_x person_100_y person_245c_delta person_245c_x person_245c_y person_700_delta person_700_x person_700_y pubinit_delta pubinit_x pubinit_y scale_delta scale_x scale_y ttlfull_245_delta ttlfull_245_x ttlfull_245_y ttlfull_246_delta ttlfull_246_x ttlfull_246_y volumes_delta volumes_x volumes_y
116545 0 -1.0 -1.0 -1.0 -1.0 -1.0 0.500 1970aaaa 19841992 0.428571 010200 030100 0.0 mu vm 1.0 [] [] -1.0 -0.5 4553 -1.0 -0.500000 mozartwolfgang amadeus 0.488550 wolfgang amadeus mozart ; text von emanuel sch... sigrid kessler ... [et al.] -0.500000 grubergernot, orelalfred, moehnheinz, schikane... -0.500000 bärenreiter -1.0 0.649175 die zauberflöte, eine deutsche oper in zwei au... bonne chance!, cours de langue française : exi... -1.0 0.000000 1 221 7
50494 0 -1.0 -1.0 -1.0 -1.0 -1.0 0.250 1982aaaa 2015uuuu 1.000000 020000 020000 1.0 bk bk 0.0 [] [978-0-7294-1156-1] -1.0 -1.0 -0.5 13 -0.500000 voltaire 0.624699 sigrid kessler ... [et al.] voltaire ; [sous la dir. de diego venturino] 0.536630 kesslersigrid venturinodiego 0.501981 interkantonale lehrmittelzentrale, staatlicher... voltaire foundation -1.0 0.509901 bonne chance!, cours de langue française : deu... siècle de louis xiv (iv), chapitres 13-24 -1.0 0.777778 4 453
82135 0 -1.0 -1.0 -0.5 les arts florissants -1.0 -1.0 0.250 1996aaaa 2005uuuu 1.000000 040100 040100 1.0 mu mu 1.0 [] [] -1.0 0.4 0630 92633 -0.5 41 42 620 1.000000 mozartwolfgang amadeus mozartwolfgang amadeus 0.794444 wolfgang amadeus mozart wolfgang amadeus mozart ; [libretto von emanue... 0.609052 mannionrosa, dessaynatalie, kitchenlinda, panz... schikanederemanuel -0.500000 brillant classics -1.0 0.752668 die zauberflöte, kv 620 : opera in two acts = ... die zauberflöte, [oper in zwei aufzügen], kv 620 -1.0 0.714286 2 2 74 77
173198 0 -1.0 -1.0 -1.0 -1.0 -0.5 3 0.625 2015aaaa 2012uuuu 0.428571 020000 020053 1.0 bk bk 0.0 [978-3-648-07838-9, 3-648-07838-0] [978-2-253-15933-9] -1.0 -1.0 -0.5 208 0.623131 basuandreas austenjane 0.610334 andreas basu ; liane faust jane austen -0.500000 faustliane 0.419608 haufe le livre de poche -1.0 0.526667 gewaltfreie kommunikation emma -1.0 0.777778 128 1
16349 0 -1.0 -1.0 -0.5 fossi, annibale (venezia) -1.0 -1.0 0.500 aaaaaaaa 1487uuuu 0.428571 010200 020053 0.0 mu bk 1.0 [] [] -1.0 -1.0 -1.0 0.000000 mozartwolfgang amadeus eusebius -0.500000 von w.a. mozart ; klavierauszug neu rev. von w... -0.500000 augustinusaurelius, cyrillus -0.500000 annibale fossi -1.0 0.550713 die zauberflöte, oper in 2 akten = il flauto m... epistola ad damasum de morte hieronymi, episto... -1.0 -0.500000 1 167

In the confusion matrix, the false positives and the false negatives are the wrongly predicted records. One way of tuning a classifier may be using different kinds of similarity metrics for an attribute. It is crucial to look at the wrongly predicted records to get an idea on the effect of the similarity metrics used. This analysis with an improvement of the data records has been done iteratively in the course of the capstone project. Some analysis will be illustrated in chapter Overview and Summary. To do so, all wrongly predicted records need to be stored in order to hand them over to the summary chapter. This is done with the help of a specific library function $\texttt{.add}\_\texttt{wrong}\_\texttt{predictions}()$.

In [18]:
import results_saving_funcs as rsf

idx = {}
idx['true_predicted_uniques'], idx['true_predicted_duplicates'], idx['false_predicted_uniques'], idx['false_predicted_duplicates'] = raf.get_confusion_matrix_indices(y_te, y_pred_dt)

wrong_prediction_groups = ['false_predicted_uniques', 'false_predicted_duplicates']

for i in wrong_prediction_groups :
    rsf.add_wrong_predictions(path_goldstandard, 
                              dt, i, df_attribute_with_sim_feature.iloc[idx_te].iloc[idx[i]])

The performance measurement described in this subsection will be repeated for all the models to come. The process will focus exclusively on the code and will leave out any additional description.

Decision Tree Classifier with Cross-Validation

In order to reach a model with a strong statistical stability, cross-validation can be used when training the model. This section will use an object $\texttt{GridSearchCV}$ from scikit-learn for this purpose.

Train/Test Split for Decision Tree CV

When doing cross-validation, the training data is split into training and validation data by the $\texttt{GridSearchCV}$ object from scikit-learn. Therefore, it is sufficient to split the original data into a train and a test data set without any additional splitting of the train data.

In [19]:
X_tr, _, X_te, y_tr, _, y_te, idx_tr, _, idx_te = cff.split_feature_target(
    df_labelled_feature_matrix, 'train_test')

X_tr[:5], y_tr[:5], idx_tr[:5]
Out[19]:
(array([[-1.        , -1.        , -0.5       , -1.        , -1.        ,
          0.25      ,  0.        ,  0.42857143,  0.        , -1.        ,
          0.16666667, -1.        , -0.5       , -0.5       ,  0.53888889,
          0.47991021, -1.        ,  0.59978811, -1.        ,  0.78333333],
        [-1.        , -1.        , -1.        , -1.        , -1.        ,
          0.4375    ,  0.        ,  0.11111111,  1.        , -1.        ,
         -0.5       , -1.        ,  1.        ,  0.57605284,  0.59184563,
          0.41919192, -1.        ,  0.7332472 , -1.        ,  0.        ],
        [-1.        , -1.        ,  0.05      , -1.        , -1.        ,
          0.25      ,  1.        ,  1.        ,  1.        , -1.        ,
         -1.        , -1.        , -0.5       ,  0.52608873,  0.61453149,
          0.41568627, -1.        ,  0.51855227, -1.        ,  0.        ],
        [-1.        , -1.        , -1.        , -1.        , -1.        ,
          0.5       ,  1.        ,  0.42857143,  0.        , -1.        ,
         -1.        ,  0.61111111,  0.55357143, -0.5       ,  0.49804219,
         -0.5       , -1.        ,  0.64228804, -1.        ,  0.51111111],
        [-1.        , -1.        , -1.        , -1.        , -1.        ,
          0.25      ,  1.        ,  0.42857143,  0.        , -1.        ,
         -1.        ,  0.        , -1.        , -0.5       ,  0.50943557,
          0.45171958, -1.        ,  0.6121175 , -1.        ,  0.        ]]),
 array([0, 0, 0, 0, 0]),
 array([  7686, 251455, 121736,  30480, 184004]))
In [20]:
print(X_tr.shape, y_tr.shape, X_te.shape, y_te.shape)
print('The test data set holds {:d} records of uniques and {:d} records of duplicates.'.format(
    len(y_te[y_te==0]), len(y_te[y_te==1])))
(207542, 20) (207542,) (51886, 20) (51886,)
The test data set holds 51591 records of uniques and 295 records of duplicates.

Model Training for Decision Tree CV

The grid search for Decision Tree classifier with cross-validation will be done with the same parameter space like for Decision Tree classifier without cross-validation. In this way, the effect of cross-validation will become obvious.

In [21]:
from sklearn.model_selection import GridSearchCV
import numpy as np

# Create cross-validation object with DecisionTreeClassifer
grid_cv = GridSearchCV(DecisionTreeClassifier(random_state=0),
                       param_grid = parameter_dictionary, cv=5
                       , verbose=1
                      )

# Fit estimator
grid_cv.fit(X_tr, y_tr)

# Get the results with 'cv_results_', get parameters with their scores
params = pd.DataFrame(grid_cv.cv_results_['params'])
scores = pd.DataFrame(grid_cv.cv_results_['mean_test_score'], columns=['accuracy_val'])
log_scores = pd.DataFrame(-np.log(1-grid_cv.cv_results_['mean_test_score']), columns=['log_accuracy_val'])
scores_std = pd.DataFrame(grid_cv.cv_results_['std_test_score'], columns=['std_accuracy_val'])

# Create a DataFrame of (parameters, score, std) pairs
df_test_scores_dtcv = params.merge(scores, how='inner', left_index=True, right_index=True)
df_test_scores_dtcv = df_test_scores_dtcv.merge(
    scores_std, how='inner', left_index=True, right_index=True).sort_values(
    'accuracy_val', ascending=False)
df_test_scores_dtcv = df_test_scores_dtcv.merge(
    log_scores, how='inner', left_index=True, right_index=True)
Fitting 5 folds for each of 18 candidates, totalling 90 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  90 out of  90 | elapsed:   50.5s finished
In [22]:
df_test_scores_dtcv.sort_values(by='accuracy_val', ascending=True)
Out[22]:
class_weight criterion max_depth accuracy_val std_accuracy_val log_accuracy_val
0 balanced gini 2.0 0.951354 0.022701 3.023190
1 balanced gini 4.0 0.985839 0.007500 4.257260
2 balanced gini 6.0 0.988720 0.003438 4.484756
3 balanced gini 8.0 0.993148 0.002125 4.983269
4 balanced gini 10.0 0.996642 0.001312 5.696305
5 balanced gini 12.0 0.997731 0.000566 6.088232
6 balanced gini 14.0 0.998410 0.000418 6.443998
7 balanced gini 16.0 0.998786 0.000392 6.713661
8 balanced gini 18.0 0.999003 0.000147 6.910370
9 balanced gini 20.0 0.999147 0.000193 7.066940
10 balanced gini 22.0 0.999215 0.000176 7.149339
11 balanced gini 24.0 0.999287 0.000145 7.245876
13 balanced gini 28.0 0.999287 0.000152 7.245876
14 balanced gini 35.0 0.999287 0.000152 7.245876
15 balanced gini 40.0 0.999287 0.000152 7.245876
16 balanced gini 50.0 0.999287 0.000152 7.245876
17 balanced gini NaN 0.999287 0.000152 7.245876
12 balanced gini 26.0 0.999292 0.000153 7.252656

The validation accuracy can be plotted as a function of the tree depth.

In [23]:
ts_dict = {}

# Reorder on index for x-axis
df_test_scores_dtcv.sort_index(inplace=True)

for kow in parameter_dictionary['class_weight']:
    ts_dict['unbalanced' if kow is None else kow] = [
        ts for ts in df_test_scores_dtcv.to_dict('records')
            if ts['class_weight'] == kow]

for kow in parameter_dictionary['class_weight'] :
    kind_of_weight = 'unbalanced' if kow is None else kow
    # Validation data plot
    plt = raf.plot_accuracy(parameter_dictionary, ts_dict[kind_of_weight], 'accuracy_val')
    plt.ylabel('accuracy')
    plt.title(f'Measured accuracy on {kind_of_weight} train and validation data')
    plt.legend()
    plt.show()
    
    # Validation data plot
    plt = raf.plot_accuracy(parameter_dictionary, ts_dict[kind_of_weight], 'log_accuracy_val')
    plt.ylabel('log(1-accuracy)')
    plt.title(f'Measured accuracy on {kind_of_weight} validation data')
    plt.legend()
    plt.show()

For the $\texttt{GridSearchCV}$ object, the best estimator can be retrieved with the help of attribute $\texttt{best}\_\texttt{estimator}\_$. The parameters for the best estimator tree are shown below. They confirm the graphs above.

In [24]:
dtcv_best = grid_cv.best_estimator_
dtcv_best
Out[24]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight='balanced', criterion='gini',
                       max_depth=26, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=0, splitter='best')

Let's have a look at the tree of the best estimator.

In [25]:
# Path for Decision Tree
decision_tree_cv_dot = os.path.join(path_tree_graphics, 'decision_tree_cv.dot')
decision_tree_cv_png = os.path.join(path_tree_graphics, 'decision_tree_cv.png')

# Export decision tree
dot_data = export_graphviz(
    dtcv_best, out_file=decision_tree_cv_dot,
    feature_names=df_labelled_feature_matrix.drop(columns=['duplicates']).columns,
    class_names=['unique', 'duplicate'],
    filled=True, rounded=True, proportion=True
)

# Generate image in .png format
! dot -Tpng $decision_tree_cv_dot -o $decision_tree_cv_png
In [26]:
Image(decision_tree_cv_png)
Out[26]:

Performance Measurement of Decision Tree CV

The confusion matrix is used on the test data set for performance analysis, see subsection Performance Measurement of Decision Tree.

In [27]:
y_pred_dtcv = dtcv_best.predict(X_te)

confusion_matrix(y_te, y_pred_dtcv)
Out[27]:
array([[51573,    18],
       [   10,   285]])

The scoring figures will be assessed in chapter Overview and Summary.

In [28]:
print('Score {:.3f}%'.format(100*dtcv_best.score(X_te, y_te)))
print('Area under the curve {:.3f}% - accuracy {:.3f}% - precision {:.3f}% - recall {:.3f}%'.format(
    100*roc_auc_score(y_te, y_pred_dtcv),
                100*accuracy_score(y_te, y_pred_dtcv),
                100*precision_score(y_te, y_pred_dtcv),
                100*recall_score(y_te, y_pred_dtcv)
               ))
Score 99.946%
Area under the curve 98.288% - accuracy 99.946% - precision 94.059% - recall 96.610%

The prediction probability tuples $y_{pred}^{probability}$ report values of $(1, 0)$ and $(0, 1)$ as seen in subsection Performance Measurement of Decision Tree.

In [29]:
y_proba = pd.Series(dtcv_best.predict_proba(X_te)[:,1])
y_proba[(y_proba>0) & (y_proba<1)] # Empty Series means no result
Out[29]:
Series([], dtype: float64)

The last step of the performance measurement subsection is to persist the wrongly classified records for full assessment in chapter Overview and Summary.

In [30]:
idx = {}
idx['true_predicted_uniques'], idx['true_predicted_duplicates'], idx['false_predicted_uniques'], idx['false_predicted_duplicates'] = raf.get_confusion_matrix_indices(y_te, y_pred_dtcv)

wrong_prediction_groups = ['false_predicted_uniques', 'false_predicted_duplicates']

for i in wrong_prediction_groups :
    rsf.add_wrong_predictions(path_goldstandard, 
                              dtcv_best, i, df_attribute_with_sim_feature.iloc[idx_te].iloc[idx[i]], '_CV')

Random Forests

Another Ensemble method is Random Forests. The results of this classifier will be presented in this section.

Train/Test Split for Random Forests

The train/test split for Random Forests will be done the same way like for the Decision Tree classifier with the goal to have three distinct data sets, one for training, one for validation and one for performance testing.

In [31]:
X_tr, X_val, X_te, y_tr, y_val, y_te, idx_tr, id_val, idx_te = cff.split_feature_target(
    df_labelled_feature_matrix, 'train_validation_test')

X_tr[:5], y_tr[:5], idx_tr[:5]
Out[31]:
(array([[-1.        , -1.        , -1.        , -0.5       , -1.        ,
          0.625     ,  0.        ,  0.42857143,  1.        , -1.        ,
         -0.5       , -0.5       ,  0.49267677, -0.5       ,  0.54033531,
         -0.5       , -1.        ,  0.57608486, -1.        , -0.5       ],
        [-1.        , -1.        , -1.        , -1.        , -0.5       ,
          0.5       ,  0.        ,  0.42857143,  0.        , -1.        ,
         -1.        ,  0.        , -0.5       , -0.5       ,  0.50978836,
         -0.5       , -1.        ,  0.56688312, -1.        ,  0.51111111],
        [-0.5       , -0.5       ,  0.06      , -1.        , -1.        ,
          0.5       ,  0.        ,  0.42857143,  0.        , -1.        ,
         -1.        , -0.5       , -1.        , -1.        , -0.5       ,
         -0.5       , -0.5       ,  0.46245348, -0.5       , -0.5       ],
        [-1.        , -1.        , -0.5       , -1.        , -1.        ,
          0.625     ,  0.        ,  0.11111111,  1.        , -1.        ,
         -0.5       , -1.        , -0.5       , -0.5       ,  0.4047619 ,
          0.48095238, -1.        ,  0.4667789 , -1.        ,  0.55555556],
        [-1.        , -1.        ,  0.05      , -1.        , -0.5       ,
          0.25      ,  0.        ,  0.11111111,  0.        , -1.        ,
         -0.5       , -1.        , -0.5       ,  0.47484737,  0.4973368 ,
          0.56635908, -1.        ,  0.55436185, -1.        ,  0.77777778]]),
 array([0, 0, 0, 0, 0]),
 array([153040,  72045, 177429, 149431, 170753]))
In [32]:
print(X_tr.shape, y_tr.shape, X_val.shape, y_val.shape, X_te.shape, y_te.shape)
print('The test data set holds {:d} records of uniques and {:d} records of duplicates.'.format(
    len(y_te[y_te==0]), len(y_te[y_te==1])))
(166033, 20) (166033,) (41509, 20) (41509,) (51886, 20) (51886,)
The test data set holds 51591 records of uniques and 295 records of duplicates.

Model Training for Random Forests

The parameters for a Random Forests classifier are different to the parameters of the Decision Tree Classifier. This is due to the differences in the algorithms, see the scikit-learn documentation for details.

In [33]:
if execution_mode == 'manual' :
    depths = [18, 20, 22]
    depths.append(None)
    parameter_dictionary = {
        'n_estimators' : [50, 75, 100],
        'max_depth' : depths,
        'class_weight' : [None]
    }
elif execution_mode == 'full' :
    depths = list(range(10, 30, 2))
    depths.append(None)
    parameter_dictionary = {
        'n_estimators' : [8, 16, 32, 64, 128],
        'max_depth' : depths,
        'class_weight' : [None, 'balanced']
    }
elif execution_mode == 'restricted' :
    depths = [18, 20, 22, 24]
    depths.append(None)
    parameter_dictionary = {
        'n_estimators' : [128],
        'max_depth' : depths,
        'class_weight' : [None]
    }
elif execution_mode == 'tune' :
    # Tune random forest classifier
    depths = list(range(16, 27))
    parameter_dictionary = {
        'n_estimators' : list(range(70, 125, 5)),
        'max_depth' : depths,
        'class_weight' : [None, 'balanced']
    }

# Grid of values
grid = cff.generate_parameter_grid(parameter_dictionary)
The grid parameters are ...
n_estimators [50, 75, 100]
max_depth [18, 20, 22, None]
class_weight [None]
 => Number of combinations : 12
In [34]:
from sklearn.ensemble import RandomForestClassifier

# Create random forest
rf = RandomForestClassifier(random_state=0) # Leave impurty measure on default value 'gini'

# Save accuracy on test set
test_scores = []
for params_dict in grid :
    test_scores.append(cff.fit_model_measure_scores(rf, params_dict, X_tr, y_tr, X_val, y_val))

# Save measured accuracies
df_test_scores_rf = pd.DataFrame(test_scores).sort_values('accuracy_val', ascending=False)
Fitting with parameters {'class_weight': None, 'max_depth': 18, 'n_estimators': 50}
 => validation score 99.940%
Fitting with parameters {'class_weight': None, 'max_depth': 18, 'n_estimators': 75}
 => validation score 99.937%
Fitting with parameters {'class_weight': None, 'max_depth': 18, 'n_estimators': 100}
 => validation score 99.940%
Fitting with parameters {'class_weight': None, 'max_depth': 20, 'n_estimators': 50}
 => validation score 99.954%
Fitting with parameters {'class_weight': None, 'max_depth': 20, 'n_estimators': 75}
 => validation score 99.947%
Fitting with parameters {'class_weight': None, 'max_depth': 20, 'n_estimators': 100}
 => validation score 99.947%
Fitting with parameters {'class_weight': None, 'max_depth': 22, 'n_estimators': 50}
 => validation score 99.957%
Fitting with parameters {'class_weight': None, 'max_depth': 22, 'n_estimators': 75}
 => validation score 99.952%
Fitting with parameters {'class_weight': None, 'max_depth': 22, 'n_estimators': 100}
 => validation score 99.959%
Fitting with parameters {'class_weight': None, 'max_depth': None, 'n_estimators': 50}
 => validation score 99.957%
Fitting with parameters {'class_weight': None, 'max_depth': None, 'n_estimators': 75}
 => validation score 99.952%
Fitting with parameters {'class_weight': None, 'max_depth': None, 'n_estimators': 100}
 => validation score 99.952%

The Random Forests parameters for the best model are shown below.

In [35]:
best_params = cff.get_best_parameters(test_scores, parameter_dictionary)

# Create a decision tree
rf_best = RandomForestClassifier(n_estimators=best_params['n_estimators'],
                                 max_depth=best_params['max_depth'],
                                 class_weight=best_params['class_weight'],
                                 random_state=0
                                )

# Fit estimator
rf_best.fit(X_tr, y_tr)
The parameters for the best model are ...
n_estimators = 100
max_depth = 22
class_weight = None
Out[35]:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=22, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

Performance Measurement of Random Forests

The confusion matrix and the scoring values for the model are shown below.

In [36]:
y_pred_rf = rf_best.predict(X_te)

confusion_matrix(y_te, y_pred_rf)
Out[36]:
array([[51576,    15],
       [   12,   283]])
In [37]:
print('Score {:.3f}%'.format(100*rf_best.score(X_te, y_te)))
print('Area under the curve {:.3f}% - accuracy {:.3f}% - precision {:.3f}% - recall {:.3f}%'.format(
    100*roc_auc_score(y_te, y_pred_rf),
                100*accuracy_score(y_te, y_pred_rf),
                100*precision_score(y_te, y_pred_rf),
                100*recall_score(y_te, y_pred_rf)
               ))
Score 99.948%
Area under the curve 97.952% - accuracy 99.948% - precision 94.966% - recall 95.932%

As described and expected in subsection Performance Measurement of Decision Tree, the prediction probability tuples $y_{pred}^{probability}$ report values within the open interval $(0, 1)$.

In [38]:
y_proba = pd.Series(rf_best.predict_proba(X_te)[:,1])
rf_best.predict_proba(X_te)[(y_proba>0) & (y_proba<1)]
Out[38]:
array([[9.90000000e-01, 1.00000000e-02],
       [9.99302326e-01, 6.97674419e-04],
       [9.20000000e-01, 8.00000000e-02],
       ...,
       [3.00000000e-02, 9.70000000e-01],
       [9.99971264e-01, 2.87356322e-05],
       [9.99681409e-01, 3.18590705e-04]])

Changing the threshold away from its default value results in modifyed values of the confusion matrix and of the scoring values, as described in subsection Performance Measurement of Decision Tree.

In [39]:
threshold = 0.1 # Modify threshold value (default is 0.5) => Tune model
y_pred_threshold = y_proba.apply(lambda x: 1.0 if x >= threshold else 0.0)
confusion_matrix(y_te, y_pred_threshold)
Out[39]:
array([[51500,    91],
       [    6,   289]])
In [40]:
print('Original score with default threshold : {:.3f}% (see above)'.format(100*rf_best.score(X_te, y_te)))
print('Area under the curve {:.3f}% - accuracy {:.3f}% - precision {:.3f}% - recall {:.3f}%'.format(
    100*roc_auc_score(y_te, y_pred_threshold),
                100*accuracy_score(y_te, y_pred_threshold),
                100*precision_score(y_te, y_pred_threshold),
                100*recall_score(y_te, y_pred_threshold)
               ))
Original score with default threshold : 99.948% (see above)
Area under the curve 98.895% - accuracy 99.813% - precision 76.053% - recall 97.966%

Finally, the wrongly predicted records for the Random Forests classifier need to be persisted for final assessment in the summary chapter. The prediction for the default threshold is taken.

In [41]:
idx = {}
idx['true_predicted_uniques'], idx['true_predicted_duplicates'], idx['false_predicted_uniques'], idx['false_predicted_duplicates'] = raf.get_confusion_matrix_indices(y_te, y_pred_rf)

wrong_prediction_groups = ['false_predicted_uniques', 'false_predicted_duplicates']

for i in wrong_prediction_groups :
    rsf.add_wrong_predictions(path_goldstandard, 
                              rf, i, df_attribute_with_sim_feature.iloc[idx_te].iloc[idx[i]])

Model Interpretation of Random Forests

For Random Forests an attribute is provided which returns an array indicating the importance of each feature. The higher the value, the more important the feature.

In [42]:
x_ticks = df_labelled_feature_matrix.drop(columns=['duplicates']).columns

plt.figure(figsize=(12,4))
plt.bar(x_ticks, rf_best.feature_importances_, color='red')
for i in range(len(x_ticks)):
    plt.text(i-0.6, 2/10, f'{rf_best.feature_importances_[i]*100:.2f}%',
             color='black', rotation=30, fontsize=13)
plt.xticks(rotation='vertical')
plt.title('Feature importance')
plt.xlabel('feature')
plt.ylabel('normed importance value')
plt.show()

The feature importance of an attribute is correlated to the degree of filling of this attribute, see chapter Data Analysis. Apart from that attribute property, its value gives insight into the role, an attribute similarity plays for a record of pairs. It may be different for varying similarity metrics used for one and the same attribute. Therefore the feature importance is an indicator for controlling the similarity metrics used.

 Summary

This chapter has trained the first models for prediction of the class of an unknown data set of test records. The calculated models belong to the family of Ensemble classifiers. The performance of each model has been measured and the way measurement used repeatedly has been introduced and explained with the very first model of Decision Tree Classifier. The models of this chapter will be compared with the results of the Dummy Classifier of chapter Features Discussion and Dummy Classifier Baseline and with all additional models to come. The assessment will be done in chapter Overview and Summary.

 Results Handover

The final results will be assessed with the help of the same test data for all three models of this chapter. The train/test split will not be repeated here, as all train/test split calls of this chapter have generated the same test data set, due to fixing $\texttt{random}\_\texttt{state}=0$. The results of this chapter still have to be persisted.

In [43]:
path_results = './results'

rsf.add_result_to_results(path_results,
                          df_test_scores_dt, dt_best, X_te, y_te, y_pred_dt)
rsf.add_result_to_results(path_results,
                          df_test_scores_dtcv, dtcv_best, X_te, y_te, y_pred_dtcv, '_CV')
rsf.add_result_to_results(path_results,
                          df_test_scores_rf, rf_best, X_te, y_te, y_pred_rf)
In [ ]: